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AWARDS ABSTRACT 

The invention is a system failure monitoring method and apparatus which learns 
the symptom-fault mapping directly from training data. The invention first estimates the 
state of the system at discrete intervals in time. A feature vector is estimated from sets 
of successive windows of sensor data. A pattern recognition component then models 
the instantaneous estimate of the posterior class probability given the features. Finally, 
a hidden Markov model is used to take advantage of temporal context and estimate 
class probabilities conditioned on recent past history. In this hierarchical pattern of 
information flow, the time series data is transformed and mapped into a categorical 
representation (the fault classes) and integrated over time to enable robust decision- 
making. 
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5 HIDDEN MARKOV MODELS FOR FAULT 

DETECTION IN DYNAMIC SYSTEMS 

BACKGROUND OF THE INVENTION 
10 

Origin of the Invention: 

The invention described herein was made in the performance 
of work under a NASA contract, and is subject to the provisions 
of Public Law 96-517 (35 USC 202) in which the contractor hat. 

15 elected not to retain title. 

Technical Field: 

The invention relates to system monitoring apparatus employ- 
ing intelligent classifiers such as neural networks responding to 
measured control inputs and system responses or symptoms causally 
related to the control inputs for classifying the current state of the 
system relative to its known failure modes. 

25 Background Art: 
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1 Introduction 

Continuous monitoring of complex dynamic systems is an increas- 
ingly important issue in diverse areas such as nuclear plant safety, 
production line reliability, and medical health monitoring systems. 
Recent advances in both sensor technology and computational ca- 
pabilities have made on-line permanent monitoring much more fea- 
sible than it was in the past. 

Health monitoring of complex, dynamic systems is a basic re- 
quirement in many domains where safety, reliability and longevity 
of the system under study are considered critical. The system of 
interest might be a nuclear power plant, a large antenna system, 
a telecommunications network, or a human heart. Health moni- 
toring can involve a variety of tasks such as detection of abnormal 
conditions, identification of faulty components, or prediction of 
impending failures. The availability at low cost of highly sensitive 
sensor technology, data acquisition equipment, and VLSI compu- 
tational power, has made round-the-clock permanent monitoring 
an attractive alternative to the more traditional periodic manual 
inspection. 

The specification will focus on the problem of accurately de- 
termining the state of the monitored system as a function of time. 
In particular, it is assumed that a sequence of observed sampled 
sensor readings 7 are available at uniformly- spaced discrete time 
intervals — without loss of generality the sampling interval is as- 
sumed to be 1. Each 7 is a /.'-dimensional measurement. Given a 
sequence of such sample vectors, y(t),j(t — 1 ), . . . , 7(0), the task 
is to infer the current state of the system at time t. 

It is assumed that the system must be in one, and only one. 
of a finite set of m states, u;,-. 1 < i < m, at any time. Let Q be 
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the discrete random variable corresponding to the (unobservable) 
state of the system, taking values in the set {u/i, . . . Note 

that the words “states” and “classes” will both be used in this 
specification but refer to the same thing. One of these states is 

5 

deemed “normal” , the other m — 1 correspond to fault conditions. 
This assumption, that the known fault classes are mutually exclu- 
sive and exhaustive, limits the proposed method to problems where 
only single-faults occur at any given time and all faults can be de- 
10 scribed in advance. The first limitation, single fault detection, is a 
known limitation of most fault detection methods and is inherent in 
the underlying nature of the sensor information available and the 
nature of the faults themselves. For example, it is possible that 
in somp problems, multiple faults result in predictable combina- 
15 tions of single fault symptoms — however, this is usually a domain- 
specific issue and is beyond the scope of discussion in this specifi- 
cation. In practice, since faults are often relatively rare compared 
to the sampling interval at which decisions are made, the prob- 
ability of two independent faults occurring within the same time 
interval is extremely small. It will be shown below that the sec- 
ond limitation, the assumption that the known faults . . . ,u; m } 
comprise the set of all faults which could potentially occur, can 
be relaxed in a general domain- independent manner. It is also 
25 assumed throughout that the monitoring process of the invention 
is entirely passive and cannot effect any changes in the system. 

2 Background on Fault Detection for Dynamic 
30 Systems 

In the typical dynamic system fault detection problem certain sig- 
nals are easily and directly measurable (the "sensors 1 ’) while others 
may be unobservable for various physical and practical reasons. 
35 For some applications, direct statistical analysis of the observed 
signals is sufficient to detect ail faults of interest. For example, it 
may be sufficient to detect a change in the mean value of a time 
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series. However, it is more typical that the observed signals must 
be transformed in some manner in order to infer the relevant fault 
information. In the ideal case where the system dynamics and 
measurement process can be completely modelled in an accurate 
5 manner, a variety of optimal control-theoretic methods for fault de- 
tection can be derived using on-line state estimation and statistical 
analysis of the residual error signals (see Willsky [1] for an overview 
of such methods). FIG. 1 is a block diagram of this method where 
10 u(t) is the system input and y(t) is the observed system output. 

In practice, however, particularly for large complex systems, it 
is common to find that the system model may not be that reliable, 
if indeed there is any system model available. A common technique 
(Isermann [2], Frank [3]) is to fit a dynamic model to the relation- 
15 ship between the measured input and output signals of the system. 
In FIG. 1, u(t) and y(t) are the measured input and output signals 
respectively, and v(t ) represents unmeasured disturbances to the 
system. 

The model is often a linear difference equation (in the discrete 

20 

time case) relating inputs and outputs, e.g., 

y(t) + £ otiy{t - i) = £ f3ju(t - i - 6) + e(t) (1) 

*= i j=i 

where e(t) is an additive noise term, p and q are the orders of the 

0 

model, and 6 is a delay term. In this example the observed data 
at time t would be 7 (t) = {u(t),y(t)} and the model parameters 
would be denoted as 0 = {cq, . . . , a p , /3i, . . . , 

30 Typically the order or structure of the model ( p and q) can be judi- 
ciously estimated based upon known system properties — however, 
the parameters 0 of the model are estimated in an on-line man- 
ner using observed input/output data. The lumped parameters of 

the model can often be related to particular system components. 
35 

Hence, fault detection occurs by observing changes in the values 
of the estimated parameter values of the fitted model (compared 
with some model of their normal condition), which in turn depend 


on the system components. This method has become known as the 
parameter method of fault detection — faults are detected by an- 
alyzing changes in the parameters of the fitted model. How much 
the parameter vector needs to change to be considered a real fault 
is the decision part of the problem and is beyond the scope of this 
specification, as it is a field for the application of statistical decision 
theory and pattern recognition (Frank [3J). 

The focus of this specification is on the problem of detect- 
ing changes in the underlying system state from parameter esti- 
mates 9(t),9(t — 1),... using both data-derived estimates of the 
parameter-state dependence and prior knowledge of the temporal 
behavior of the system. As mentioned earlier the system is assumed 
to always be in one, but only one, state u>i, 1 < i < m, at any point 
in time, i.e., the states are mutually exclusive and exhaustive. It 
is also assumed that the distribution of parameters conditioned on 
a given state, p(9\Q = ujj) (where both are measured at the same 
time t) is stationary, but that there may be some overlap of these 
state-conditional distributions. This specification will refer to the 
dependence p(9 )0 = a;,) as the instantaneous model between the 
parameters and states. In the case of complete overlap (where two 
or more states possess identical distributions) there is naturally 
no way to identify the underlying states just bv observing the pa- 
rameters and knowing the instantaneous model. However, as will 
be shown later in this specification, even when there is significant 
overlap in the instantaneous model, accurate state identification 
is still possible by taking temporal context into account using a 
hidden Markov model. 

It will be assumed herein that the application is such that a 
database or fault library can be generated for both the normal 
class and the fault classes {o? 2 , . . . ,u; m }. The database consists 
of pairs of symptom vectors and class labels, {0, Q(j9)}. where 0 is 
the d-dimensional parameter vector estimated from the observed 
system data. Note that the mapping from 9 to Q{9) need not 
be one-to-one. since the conditional dependence of 9 given that 


fl(0) = is typically probabilistic in nature. 

The assumption of availability of labelled training data rules 
out applications where it is not possible to gather such data — 
perhaps no such data has been collected in the past and it is not 
possible to simulate faults in a controlled manner. However, there 
are many applications where either a fault library already exists, 
or can be created under controlled conditions (perhaps by testing 
a particular system in a laboratory). The important point is that 
for fault diagnosis problems for which such symptom-fault data is 
readily available, standard supervised classification or discrimina- 
tion methods can be used to learn a fault diagnosis model from 
this database. 

It is important to note that the parameter estimation technique 
generally requires far less precise knowledge about the system than 
the previously-mentioned state-space approach and, hence, tends 
to be both more widely applicable and more robust from a practi- 
cal standpoint. For example, in the case of the antenna monitoring 
problem to be described later, both the presence of non-linearities 
and the inherent complexity of the system make it difficult to de- 
velop an accurate state-space model. In contrast, the parameter 
model method can be implemented with relative ease. Naturally, 
if there is enough knowledge of the system available such that the 
state-space approach is feasible, then this should give better results 
since it takes advantage of more information. 

As an aside, mention should also be made of knowledge-based 
or artificial intelligence models which employ qualitative models of 
system, behavior to detect faults. First-generation knowledge-based 
systems typically use experiential heuristics (described in the form 
of expert-supplied rules) to describe symptom-fault relationships. 
More sophisticated second -generation methods (under the broad 
heading of ''model-based reasoning") use qualitative causal models 
of the system to represent "first- principles” knowledge (Bratko. 
Mozetic and Lavrac [4] and Davis [5]). 

In principle, this allows the system to identify faults which have 


never occurred before. Both approaches have limited applicability 
at present in terms of handling the dynamic and uncertain nature 
of many real-world problems. In general, the qualitative symbolic 
representation j>; not particularly robust for dealing with noisy, con- 
5 tinuous data containing temporal dependencies. Furthermore there 
are many applications for which neither domain experts nor strong 
causal models exist, thus making the development of a knowledge- 
base very difficult. 

10 

SUMMARY OF THE DISCLOSURE 

The present invention learns the symptom- fault mapping directly 
15 from training data. The invention first estimates the state of the 
system at discrete intervals in time. A feature vector 9 of dimension 
k is estimated from sets of successive windows of sensor data. A 
pattern recognition component then models the instantaneous es- 
timate of the posterior class probability given the features, p(ui\6), 
on 

1 < i < m. Finally, a hidden Markov model is used to take ad- 
vantage of temporal context and estimate class probabilities con- 
ditioned on recent past history. In this hierarchical pattern of 
information flow, the time series data is transformed and mapped 
25 into a categorical representation (the fault classes) and integrated 
over time to enable robust decision-making. It is quite generic to 
systems which must passively sense and monitor their environment 
in real-time. 

The invention is a method of monitoring a system having a 
30 normal working state corresponding to normal operation of the 
system and a plurality of individual failure states corresponding to 
different failure modes of the system, the system exhibiting respec- 
tive sets of measurable parameters including inputs and behavior 
symptoms causally related to the inputs. The method begins by 

35 

defining plural transition probabilities for plural pairs of the states, 
each transition probability being related to the probability that the 
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system will change from one to the other of the pairs of states at 
any time. The method continues with observing a set actual val- 
ues of the parameters in a current one of the sampling intervals. 
FYom this, an instantaneous probability is obtained which is an 
5 estimate of the probability of one of (a) the set of actual values 
being observed and (b) the system being in the one state, given 
the other of (a) and (b). Plural respective intermediate proba- 
bilities are then computed corresponding to respective ones of the 
10 states, each intermediate probability being equal to the correspond- 
ing instantaneous probability of the one state multiplied by a sum 
over plural states of the intermediate probability for a given state 
computed during the previous sampling interval multiplied by the 
transition probability between the given state and the one state. 
15 Finally, a posterior probability that the system is in one of the 
states given the sets of actual values observed over the current and 
previous sampling intervals is computed for each state from the in- 
termediate probability of the current sampling interval for states. 
Whether the system is in a failure state is determined by compar- 

20 

ing the posterior probabilities of all the states, and an indication 
thereof is issued. 

In one embodiment, the instantaneous probability is an instan- 
taneous estimate of the probability that the system is in the one 
25 state given the set of actual measurements, divided by an uncon- 
ditional probability of the system being in the one state. In this 
embodiment, computing a posterior probability is performed by 
equating the posterior probability with the intermediate probabil- 
ity computed for the current sampling interval. 

30 In another embodiment of the invention, the instantaneous 
probability is a probability of the actual values of the current sam- 
pling interval being observed given the system being in the one 
state. In this latter embodiment, computing the posterior proba- 
bility is performed by dividing the intermediate probability by an 
35 

unconditional probability of observing the sets of actual values of 
the current and previous sampling intervals. 
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In this latter embodiment, the instantaneous probability may 
be obtained by first obtaining from a classifier responsive to the 
parameters an instantaneous estimate of the probability that the 
system is in the one state given the set of actual measurements; 

5 and then transforming the classifier’s instantaneous estimate to the 
instantaneous probability using Bayes’ rule. On the other hand, the 
instantanous probability may be obtained directly from a classifi er 
trained to output the instantaneous probability for each state in 
10 response to the set of actual values. 

Defining plural transition probabilities includes observing a 
mean time between failures (MTBF) characteristic of each of the 
failure states and computing each corresponding transition prob- 
ability therefrom. Computing the corresponding transition prob- 
15 ability includes dividing the time period of the sampling intervals 
by the MTBF and subtracting the resulting quotient from unity. 

Obtaining an instantaneous probability for each one of the 
states includes observing the frequency of each failure state of the 
system and the corresponding parameter values over a period of 

20 

time relatively long compared to the sampling intervals, construct- 
ing a training data set associating the frequency of each failure state 
with different sets of corresponding parameter values, and using a 
classification algorithm operating on the training data to infer from 
25 the parameter values observed during the current sampling interval 
the instantaneous probabilities of the current sampling interval. 

The classification algorithm directly provides an instantaneous 
probability for each one of the states that the system is in the re- 
spective state given the set of parameter values observed during 
30 the current sampling interval. Using the classification algorithm 
includes transforming the instantaneous probabilities to the instan- 
taneous probabilities using Bayes’ rule. It further requires, in one 
embodiment, training a neural network on the set of training data, 

and then inputting the parameter values of the current sampling 
35 

interval to the neural network while permitting the neural network 
to infer the instantaneous probabilities of the current sampling in- 
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25 


terval. 

In another embodiment, obtaining an instantaneous probabil- 
ity for a failure state is accomplished without training data related 
to that failure state and accomplished by determining for each 
parameter of that failure state upper and lower bounds on the pos- 
sible values thereof, and computing the instantaneous probability 
of that failure state from the upper and lower bounds. Computing 
of the instantaneous probabilities includes multiplying together all 
reciprocals of the differences between the upper and lower bounds 
of the parameters of that failure state. Preferably, in this embodi- 
ment, there are only two system states: a normal state and a failure 
state. 

In a preferred implementation, observing the parameters in- 
cludes monitoring measurements of input commands and perfor- 
mance variables of the system and converting the measurements to 
parameters indicative of changes in the measurements. The param- 
eters can include autoregressive coefficients of the measurements, 
variances of the measurements and mean values of the measure- 
ments. 

The computing of the posterior probabilities from the inter- 
mediate probabilities includes, for the posterior probability of the 
observed set of parameter values given each state of the system, di- 
viding the intermediate probability of the corresponding state given 
the observed set of parameter values by a probability of observing 
the observed set of parameter values. 


30 


BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a diagram illustrating a method of fault detection of 
the prior art. 


FIG. 2 is a block diagram of an apparatus embodying the 
present invention, of which FIG. 2A illustrates an antenna point- 
ing system being monitored and FIG. 2B illustrates fault detection 
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apparatus embodying the invention. 

FIG. 3 is a graph comparing estimates of probability of the true 
class for normal conditions as a function of time obtained from the 
5 neural-Markov embodiment of the invention and obtained with a 
prior art neural network. 

FIG. 4 is a graph of estimates of probability of the true class 

10 corresponding to a compensation loss in the antenna pointing sys- 
tem as a function of time obtained from the neural-Markov em- 
bodiment of the invention and obtained with a Gaussian-Markov 
embodiment of the invention. 

15 FIGS. 5 A, 5B and 5C are graphs of three separate contempo- 
raneous plots aligned vertically along the time axis of estimated 
probabilities of three respective classes or states (corresponding to 
the normal state, a tachometer fault and a compensation loss fault, 
respectively) obtained simultaneously with a prior art neural net- 
work, over a time interval during which the system is in the three 
corresponding states one-at-a-time in succession. 

FIGS. 6A, 6B and 6C are a graphs of three separate contempo- 

25 raneous plots, aligned vertically along the horizontal time axis, of 
estimated probabilities of the three states of FIG. 5A, respectively, 
obtained simultaneously with the neural-Markov embodiment of 
the present invention, over a time interval during which the sys- 
tem is in the three states one-at-a-time in succession. 

30 

FIG. 7 is a diagram of a neural network employed in combina- 
tion with the invention. 

35 DETAILED DESCRIPTION OF THE PREFERRED 

EMBODIMENTS 


-14- 


3 Learning Symptom- Fault Mappings 

This specification focuses on the use of the general parameter es- 
timation method. In particular, for the purposes of this speci- 
5 fication, the estimated parameters or “symptoms” of the system 
correspond directly to the feature vector representation in a classic 
pattern recognition model and are derived from the original ob- 
servable sensor data 0(t). In turn, the system states (normal and 
fault conditions) correspond to classes. 

The details of the particular classification model used to gen- 
erate the symptom-fault mapping are not directly relevant to the 
general discussion. If there is prior knowledge that the probabil- 
15 ity dependence of the symptoms conditioned on the faults obeys a 
particular parametric form, such as multi-variate Gaussian, then a 
maximum-likelihood method to estimate the parameters of the con- 
ditional distributions may be appropriate. More commonly there is 
little prior knowledge regarding the symptom-fault dependencies. 
20 In this case non-parametric discriminative methods such as linear 
discri ninants, nearest-neighbor (&NN) methods, decision trees, or 
neural networks may all be useful approaches depending on the 
exact nature of the problem at hand. Recent studies using several 
oe well known data sets have shown that all of these classification 
models perform roughly equally well in terms of predictive accu- 
racy, i.e., their classification performance on independent test data 
sets was often statistically indistinguishable from each other (Ng 
and Lippmann [6], Weiss and Kapouleas [7]). Hence, other at- 
30 tributes of the classification method such as complexity, the ability 
to handle high dimensional problems, small-sample performance, 
explicit knowledge representation, and so forth, can become the 
deciding factors for a given application. 

One particular requirement is imposed on the classification 
method to be used, namely that it produce estimates of the pos- 
terior probabilities of the classes in/, 1 < i < m, given the input 


35 
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symptoms 0, i.e., p(Q = In many practical applications esti- 

mation of posterior probabilities (as opposed to a simple indication 
of which class is most likely) is very useful to allow one to control 
the false alarm rate, the rejection rate, and so forth. 

5 

Rather than deal with the time series data directly one usually 
seeks to extract invariant characteristics of the time series wave- 
forms, where the invariance is with respect to different environ- 
mental conditions of operation of the system conditioned on a par- 
10 ticular class. These invariant characteristics correspond directly to 
the estimated system parameters discussed earlier, i.e., what are 
called system parameters in the control literature can be treated 
as feature vectors for readers more familiar with pattern recogni- 
tion terminology. This feature extraction stage can critically affect 
15 the classification performance of the overall system. Note that the 
terms symptoms ar 1 features are used interchangeably herein. 

One feature extraction method is employed whereby the data is 
windowed into separate consecutive blocks, each containing an in- 
teger number T samples. Many variations of this sampling scheme 
are possible, for example, the use of overlapping blocks or recursive 
estimators. This specification is confined to the relatively simple 
case of disjoint, consecutive blocks, each of which contain T sam- 
25 pies. In practice T is chosen to be large enough to give reasonably 
accurate estimates of the features so as to reduce the sampling 
variance across different windows. For autoregressive models such 
as Equation (1), the 0 coefficients are estimated from all of the ob- 
servations in a given window of consecutive samples using standard 
30 methods such as least squares estimation, i.e., 

»(t) = 

35 d(t-T) = f(y{( - T),7(/ - (T + 1)) 7(/-(2T-l))V (2) 


and so forth. 

What lias been expressed at this point, assuming that a par- 
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ticular estimation method and classification algorithm had been 
chosen, is simply a framework for generating estimates of the state 
of the system at any point in time, i.e., at intervals of time T 
the classification system will produce estimates of the posterior 
5 class probabilities given the features which are estimated over the 
[i,t - T] time interval. This approach makes an independent deci- 
sion at each time instant, i.e., class probability estimates or symp- 
tom data from the past do not influence the present estimates. 
Clearly this is suboptimal given the fact that faults are persistent 
10 over time and, hence, that better class estimates could be obtained 
by making use of past information. Two obvious approaches spring 
to mind in order to model this temporal context. In the first, one 
could introduce some form of memory into the classification model. 
Examples of such memory methods include recurrent neural net- 
15 works (i.e., networks where the outputs are fed back to the inputs 
after a unit delay, as in Pineda [8], Pearlmutter [9] or a “window 
in time” technique whereby the classifier is trained not only on 
feature values at time t , but also on values from time t — T back 
to t — MT where M is the memory of the model (Waibel et al. 
20 [10]). This approach of implicitly modelling temporal context has 

the significant disadvantage of making it much more difficult to 
train the classifier. The second approach (which is now described), 
of using a hidden Markov model, is much more elegant in that 
it combines over time the instantaneous estimates of the trained 
25 classifier by taking advantage of prior knowledge about the gross 
statistical properties of the failure modes of the system. 

4 Hidden Markov models for modelling tempo- 
30 ral context 

The use of discrete-time, finite-state, hidden Markov models for 
smoothing classification decisions over time is now described. Note 
that for the purposes of this discussion the terms “class"' and 
“state” are equivalent, i.e., both refer to the set of normal and 
35 fault conditions {u>i u; m }. 

A first-order temporal Markov model is characterized (in the 
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present context) by the assumption that 

p(n(t) = ui\n(t-T),...MO)) =p(n(t) = “i\n(t- r n), i<i<m , (3) 

5 for all t. 

This means that the conditional probability of any current state 
given knowledge of all previous states is the same as the conditional 
probability of the current state given knowledge of the system state 
10 at time t — T. Hence, assuming stationanty, to calculate the prob- 
ability of any state at time f, one need only know the initial state 
probabilities 7r(0) = [p(H(0) = u;i),p(Q(0) = U 2 ), . . . ,p(Q(0) = 
u; m )] and the values p(ft(t) = Uj\Q(t — T) = Uj), 1 < i,j < m. 
The m x m matrix A, where cijj = p{Q(t) = — T) — ujj), 

15 is known as the transition matrix and characterizes the Markov 
model. Given A and 71 one can calculate the probability of any 
state at any time t. 

It is now assumed at this point of the discussion that the 
discrete- time Markov model described above can be used to model 
20 the failure behavior of the system of interest, i.e., at any time t, 
given that the system is in a particular state j, the probability 
that the system will be in state i at time t 4- T is described by the 
state transition probability = p(Q(t) = — T) = uij). The 

implications of using such a model and the use of failure rates to 
25 estimate the transition probabilities will be discussed below. How- 
ever, at this point the specification focuses on how the model is 
used. Markov models such as this can be used for reliability anal- 
yses to determine long-term failure rates and modes of a system 
(Papazoglou and Gyftopoulos [11]). 

30 However, the goal here is somewhat different, namely to moni- 

tor the system in real-time. The key point is that the states of the 
system are not directly observable, but are hidden , i.e., the moni- 
toring system has no direct way to measure the state of the system, 
even for past time. Instead, various symptoms or features j 9(t) are 
35 observable. These features are a probabilistic function of the states: 
in fact the classification models mentioned earlier can estimate an 
instantaneous symptom-state mapping p[Q{t) = i\0{t )). By mak- 
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ing the appropriate conditional independence assumptions, one can 
estimate = u>i\O(t),0(t -T ).. . ,0(0)) without explicitly pro- 

viding the 0(t - T), . . . , 0(0) as direct inputs to the classifier. 

The hidden Markov formalism provides an exact solution to 
5 this problem provided the underlying conditional independence as- 
sumptions are met. It has been widely applied with significant 
success in speech-recognition applications (Rabiner [12]). Let the 
probability of the observed data be p($<) = p{0(£), . . . ,0(0)}. It is 
convenient to work in terms of an intermediate variable a\ where 

° an(0=p(n(t) = w(,*i)- (4) 


15 


20 


25 


30 


35 


To find the posterior probabilities of interest it is sufficient to be 
able to calculate the a’s at any time t since by Bayes’ rule 

ati(t) 


p(Q(t) 

A recursive estimate is derived as follows: 

«,’(*) = - T) = uj) 

3=1 V 

= jr P (mt) = -t) = uj\ 

3 = 1 V ' 

= jrp(m = ui,e{t)\*t-T,mt - t) = -t) = uj ) 

3 - 1 ' ' 


(5) 


= £ p[n(t) = U3i,m\*t-r, - T) = - T) 

(by the definition of a } ) 

= f^ P (m\m = ui'*t-T,w-T) = u>i\ x 

3 = 1 V / 

p(tt(t) = -T) = uj'jajU - T) 

= ^7^£(/)|fi(/) = *3ijp(to{t) = - T) = u )jjatj{t - T) 

(assuming that tf(t) is independent of past observations 
and past states, given the present state) 

= jrptwmt) = ^i)ph(t) = - T) = u/j)aj(l - T) 

(assuming that fi(t) is independent of past observations 
given the past state f2( t — T ) ) 
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= J2 wwi* - T ) ( 6 ) 

The first term can be derived from the classifier’s estimate of 
p(il(t) = u>i\d(t)) and Bayes’ rule. This estimate provided by the 
5 classifier is referred to as the instantaneous probability. (Alter- 
natively, a classifier could be employed which has been trained to 
provide instantaneous estimates of the first term itself, namely an 
estimate of the probability for each state of having made the actual 
observations, thus obviating the need to invoke Bayes’ rule.) The 
10 terms in the sum are just a linear combination of the a’s from the 
previous time-step. Hence, Equation 6 gives the basic recursive 
relationship for estimating state probabilities at any time t. 

From Equation ( 6 ), a more practical recursive estimate is de- 
rived as follows: First, the term p(^0(t)\Q(t) = uj t ^j is replaced by 
p(Q(t) — Ui\d(t)) /p(Q(t) = u ;,•) (where the denominator is the prior 
probability of state i and is estimated prior to operation in the 
standard manner). Second, the aj(t — T) terms are each replaced 
by p(Q(t — T) = u;j|<I><_j’). These two substitutions together are 
2Q equivalent to dividing both sides of Equation 6 by p($t) and give 
the equivalent recursive relation: 
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35 


p(fi = U>,j$<) 


p(Q(t) = qj,jfl(Q) 

p(Q(t) = ujj) 


£ aijp(Q(t - T) 


i=i 


= Uj\$t-r) 


The additional assumptions made in the derivation of Equation 
6 (besides the first-order Markov assumption on state dependence) 
require some comment. The first assumption is that 0(t) is in- 
dependent of both the most recent state and the observed past 
data, given that the present state is known. This implies that the 
observed symptoms are statistically independent from one time 
window to the next, given the state information. For disjoint, non- 
overlapping, blocks of data this will generally be true if the feature 
sampling rate -f is greater than any significant frequency compo- 
nents in the underlying observed time-series 7 (f). For overlapping 
blocks of data, or where T is comparable to the time constants of 
the dynamic system, observed symptoms would no longer be inde- 
pendent and the model would be modified to include a measure of 
this dependence. The second assumption, that the present state 
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only depends on the previous state but not the past observations, 
seems quite reasonable: there is no reason to expect that states in 
the future depend on the actual observed data values in the past. 

Note that the state probabilities are calculated here based on 
5 past information. Alternative estimation strategies are possible. 
For example, using the well-known forward-backward recurrence 
relations (Rabiner 12 ) one can update the state probability esti- 
mates using symptom information which occurred later in time, 
i.e., estimate p(Q(t) = u; t |0(f + kT ), . . . ,0(t ), . . . ,0(0)). Ftom an 
10 operational standpoint this allows further smoothing of glitches 
and a consequent reduction in false alarms — the disadvantage is 
that there is a latency of time kT before such an estimate can be 
made. Another approach is to use the Viterbi algorithm to esti- 
mate the most likely joint sequence of states, i.e., 

maxjp(f2(£) = u>,-, . . . , Q(0) = j. 

Which scheme is used depends largely on the particular appli- 
cation and each can easily be implemented using a variation of 
the recursive equations derived above. The probability estimation 

20 

method based only on past and present measurements (as described 
in Equations 5 and 6) is the most direct method for on-line moni- 
toring and will be assumed throughout the rest of the specification. 

25 5 The Nature of the Markov transition matrix 

In the previous sections herein, the existence of the transition ma- 
trix A has been assumed. The question naturally arises in practice 
as to how the entries in this matrix are obtained. For speech racog- 
3Q nition applications there is typically an abundance of training data 
from which A can be estimated by the use of iterative maximum 
likelihood procedures such as the Baum- Welch algorithm. How- 
ever, for reliability monitoring, while there may be data obtained 
under specific normal and fault conditions, there will typically not 
35 be a set of training data corresponding to a sequence of state tran- 
sitions. Hence, in practice, prior knowledge regarding the overall 
system reliability and behavior must be brought to bear in or- 
der to provide estimates of A. The invention adopts a divide-and- 
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conquer approach by dividing the states into 3 categories: first 
is the normal state, then the intermittent states, and finally the 
“hard-fault” states. The difference between the latter two is that 
intermittent failures allow the possibility of returning to the normal 
5 state whereas the “hard- fault” states do not. 


5.1 Specification of the “normal- normal” transition prob- 
ability, an 


10 The use of a first-order Markov model to describe failure processes 
implicitly assumes that the lengths of times between failures are 
distributed geometrically. This follows from the fact that for a 
discrete-time Markov model the probability that the system stays 
in state i for n time steps is p n ~ 1 ( 1 — p) where p = a\\. The 
15 memoryless assumption which leads to the geometric distribution 
of inter-failure durations is quite robust and plausible for many ap- 
plications and is widely used in reliability analysis to model failure 
processes (Siewiorek and Swarz [13]). 

By relating the Markov transition parameters to overall failure 
20 statistics of the system, the invention can both check the validity 
of the geometric distribution assumption and also determine the 
transition probabilities themselves. The xpected length / of time 
spent in state oq, given that it starts in state uq, is 


25 


E[l] = 53 nafi ‘(1 - an) 

n=l 

1 

1 — an 


(7) 

( 8 ) 
(9) 


in units of time T. Thus, the mean time between failure (MTBF) 
of the system can be expressed as 


MTBF _ 1 

T 1 — an 


( 10 ) 


35 


and, hence. 


a n — 1 — 


T 


( 11 ) 


MTBF 

where the MTBF and T are expressed in the same time units. In 


this manner, MTBF statistics can be used as the basis for estimat- 
ing «n. The MTBF of the system can typically be either specified 
by a reliability analysis (for a new system) or can be estimated 
from a problem database (for a system which has been in use for 
some time). Note that T will be chosen to be much smaller than 
the MTBF in practice. 

5.2 Specification of the fault transition probabilities 

Transition probabilities into both intermittent and hard faults from 
the normal state are found by weighting 1 — an (the probability 
of the system entering a fault state at the next time step given 
that it is currently in the jormal state) by the anticipated relative 
likelihood of occurrence of each fault state. These relative likeli- 
hoods may be derived from reliability analyses or can be estimated 
empirically if a problem database exists. 

The mean anticipated duration of intermittent failures can be 
used to calculate the self-transition probability for intermittent 
states in an analogous manner to the way in which the MTBF 
was used above to find an. Knowledge of intermittent fault dura- 
tion is typically more subjective in nature than finding the MTBF 
and may require knowledge of the physics of the fault condition. 

Conceptually, hard faults present a problem (in the context 
of Markov monitoring) since once such a fault occurs the system 
can not return to the normal state until the fault is physically re- 
paired, which in turn typically requires downtime of the system. 
In practice, a sensible approach is to define an “absorbing” state 
which indicates that the system has been halted. Hence, the only 
allowable transition out of a hard fault state is into the halt state. 
The length of time which the system may spend in the hard fault 
state, before the halt state is arrived at, is largely a function of the 
operational environment: if the Markov monitoring system itself 
is being used as part of an overall alarm system, or if the fault is 
detectable by other means, then an operator may shut down oper- 
ations quickly. On the other hand, if the fault does not manifest 
itself in any significant observable manner and if the Markov mon- 
itoring system is being used only for off-line data analysis, then 
the system may remain in the hard fault state for a lengthy period 
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of time. Hence, deciding how the self-transition probabilities are 
chosen for the hard-fault classes will be quite specific to particular 
operational environments. 

To complete the Markov transition matrix it is sufficient to 
5 note that “fault-to-fault” transitions are normally disallowed ex- 
cept in cases where there is sufficient prior knowledge to believe 
that intermittent faults can occur directly in sequence. 
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5.3 Comments on Robustness and Dynamics 

The process of defining the Markov transition matrix is obviously 
quite subjective in nature. While this could be viewed as a weak- 
ness of the overall methodology, one can argue that in fact it is 
a strength. In particular, it allows the effective coupling of rela- 
tively high- level prior knowledge (in the form of the Markov tran- 
sition matrix A) with the “lower-level” data-driven estimation of 
p(f2|0). Naturally, the latitude in specification of A leads to ques- 
tions regarding the sensitivity of the method to misspecification. 
While a systematic sensitivity study is beyond the scope of this 
specification, empirical results using this method suggest that un- 
less the parameter-state conditional densities are almost entirely 
overlapped, then the model is quite robust to variations in A — 
typically, only the length of time to switch between states (“time 
to detect” ) is directly affected. 

For a typically reliable system the dynamics of the Markov 
model will be such that it will remain in the normal state for long 
stretches of time. It is important to realize that the relatively static 
behavior of the model should not undermine the reader’s assess- 
ment of its practical utility: for many problems it is often extremely 
difficult to design detectors of rare events which have both a low 
false alarm rate and a high detection rate. For example, in the 
next section an application is described in which the system makes 
classification decisions every 6 seconds or so. while the MTBF is on 
the order of a few days. For this application, if the Markov model 
component of the method is omitted and only the instantaneous 
state estimates are used, the false alarm rate increases dramatically 
to the extent that this non-Markov method would be completely 
impractical for use in an operational environment. 
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ALTERNATIVE EMBODIMENT OF THE INVENTION 


The problem of interest is that of detecting faults or changes 
in the observed characteristics of time series data which is being 
monitored on-line from a dynamic system. Problems which fall into 
this category include fault detection in large complex hardware sys- 
tem/ (such as nuclear power plants, chemical process plants, large 
antenna systems) and biomedical monitoring of critical signals in 
humans (such as pacemakers and so forth). If there exists instan- 
taneous good models of (1) the system which is being monitored, 
(2) any noise which might be present in the measurement process 
and (3) the likely behavior of the system when a fault occurs, then 
standard model- based techniques exist which can accurately detect 
changes. 

In practice however, particularly for large complex systems, 
there is often little prior knowledge available in the form of accu- 
rate models, rendering the model-based method ineffective. Hence, 
it is common in commercial products to use much simpler threshold 
alarm methods which trigger an alarm whenever a derived param- 
eter of interest (from the observed time series), or the amplitude of 
the time series itself, exceeds some pre-specified limit. The prob- 
lem with this approach is that it is likely to be very sensitive to 
false alarms if noise is present and will not detect subtle changes 
in the characteristics of the sip'.al under observation. 

The method described above to address the on-line fault de- 
tection problem uses a Hidden Markov model. The method is ex- 
tremely robust to false alarms, does not require a model of the 
system under normal or fault conditions, and can detect subtle 
changes in signal characteristics. The method makes the following 
assumptions: 

• Al: There is a known set of m - 1 mutually exclusive and 
exhaustive faults, denoted as uq, — where uq denotes 
normal conditions. 

• A2: Training data for both normal and fault conditions are 
available which consists of time series sequences. 
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• A3: The observed time series data is stationary under both 
normal and fault conditions. 

• A4: Information about the mean time to failure for each fault 

„ mode is available. 

5 

However, this method suffers from the significant disadvantage 
of assumptions A1 and A2, namely that training data is required 
for a prespecified set of faults. While data is usually easy to acquire 
for normal conditions, it is often impractical to obtain data under 
10 fault conditions. 

In the alternative embodiment of the invention, assumptions 
A1 and A2 cr,n be replaced by a much less restrictive pair of as- 
sumptions while still retaining the overall advantages of the inven- 
tion. The new assumptions are as follows: 

15 

• Al*: Training data under normal conditions is available, 

• A2*: Physical limits can be placed on any parameters of inter- 
est which can be derived from the time series. 

Assumption Al* is trivial since it is difficult to imagine an appli- 
cation where data under normal conditions cannot be obtained. 
Assumption A2* essentially states that there must exist sufficient 
prior knowledge about the observed parameters such that a density 
function can be specified instantaneous on these parameters. The 
role of this density function will now be explained. 

The parameters of interest at time t are denoted as a vector 
0(t). The parameters are typically statistical estimates of some 
characteristic of the time series such the mean, variance, or auto- 
regressive (AR) coefficients. As discussed above, it is by observing 
changes in these derived parameters that the HMM method de- 
tects changes in the underlying time series (and, hence, the system 
itself). The invention, as described above, requires probability es- 
timates of the form y3(0(f)|u;;(f)), 1 < i < m, as a central part of 
the model. These in turn are obtained by Bayes rule from the esti- 
mates p(uJi(t)\0(t)) which are learned from the training data. Since 
the process is assumed to be stationary given cu, the reference to 
time t can be dropped at this point. 

In the alternative embodiment, the changes are as follows: 


- 26 - 


1. For u)\ (normal conditions) calculate p(uq|0) using either a 
parametric density or a non-parametric density estimate where 
the density is fitted to the available training data. 

2. For o >2 (non-normal conditions), specify a prior density in the 
form of Pprior^!^) where u >2 signifies non-normal conditions. 

The first change is quite straightforward and merely requires that 
a multi-variate density be fitted to the observed parameters — 
standard techniques are available for this purpose. Alternatively, 
10 if there is prior knowledge available (e.g., such that the parame- 
ters obey a multi-variate Gaussian assumption under normal con- 
ditions), this can also be used to specify the density directly. The 
second change requires that Pprior^i^ 2 ) he available. If assump- 
tion A2* holds, and in the absence of any other specific information 
15 about the parameter behavior under fault conditions, one can spec- 
ify a uniform density for Ppnor^l^ 2 ) where the ranges correspond 
to the physical limits on the parameters specified in A2*. In prac- 
tice these limits are usually available, For example, the variance 
of the signal can be bounded based on the overall energy avail- 
20 able to the system — similarly, AR coefficients must obey certain 
constraints if the underlying process is stationary. The choice of 
the uniform density is the most appropriate when there is no prior 
knowledge about the parameters (other than the ranges) — if prior 
knowledge is available, other prior densities could be used. 

25 Implementation of the Alternative Embodiment: The exact 
changes required to implement the new method are now described: 

1. Set up a 2-state hidden Markov model in accordance with the 
foregoing description where u>i corresponds to normal condi- 
tions and UJ 2 is non-normal. 

30 

2. Obtain the transition probabilities for the Markov portion of 
the model from fault duration data as described above. 

3. Determine the functional form of p{uj\\9) using methods de- 
scribed above. 

35 

4. For each parameter 0j, 1 < j < P (where P is the number of 
parameters), specify upper and lower bounds, cij and bj respec- 
tively, on the possible values which Oj can take. 
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5. Specify the density Pprior^l 0 ^) 343 

Vior( # l a ' 2 ) = n 

if there is no prior knowledge available other than the range 
of parameter values and the density under normal conditions 
(p(uJi\6)). If prior knowledge is available then use this infor- 
mation to specify Pprior^l^)- 

6. Perform the process of the invention as described above, except 
that in equation (6) the p(0(t)\ui(t,)) term is now calculated as 
described in steps 3 and 5 above. 

There are several possible extensions to the alternative embod- 
iment, including the use of on-line adaptation to improve the initial 
models and the incorporation of specific fault models in the case 
where such prior knowledge of fault behavior is available. These 
extensions are technically relatively straightforward given the un- 
derlying method as described here. 

The alternative embodiment requires fewer assumptions than 
the foregoing main embodiment while still retaining many of the 
advantages of the main embodiment. Implementation is quite sim- 
ple and has a very low computational complexity (order of P.m 
calculations per time step). In addition, in the alternative embod- 
iment, setting up the model simply requires the specification of 
some ranges on the parameters of interest and some normal train- 
ing data — hence, the method should be relatively robust and 
could conceivably be used as part of an “off-the-shelf’ product by 
non-specialists. Given the simplicity and reliability of the method, 
it would appear that it may have considerable practical utility for 
a wide variety of on-line monitoring applications. 

In the remainder of this specification, the description concerns 
the main embodiment of the invention. 


6 Background on Antenna Fault Diagnosis 

Application of the hidden Markov model to a real fault monitoring 
problem is now described. It is first helpful to provide some back- 
ground. The Deep Space Network (DSN) (designed and operated 
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by the Jet Propulsion Laboratory for the National Aeronautics 
and Space Administration (NASA)) provides end-to-end telecom- 
munication capabilities between earth and various interplanetary 
spacecraft throughout the solar system. The ground component of 
5 the DSN consists of three ground station complexes located in Cal- 
ifornia, Spain and Australia, giving full 24-hour coverage for deep 
space communications. Since spacecraft are always severely lim- 
ited in terms of available transmitter power (for example, each of 
the Voyager spacecraft only use 20 watts to transmit signals back 
10 to earth), all subsystems of the end-to-end communications link 
(radio telemetry, coding, receivers, amplifiers) tend to be pushed 
to the absolute limits of performance. The large steerable ground 
antennas (70m and 34m dishes) represent critical potential single 
points of failure in the network. In particular there is only a single 
15 70m antenna at each complex because of the large cost and cal- 

ibration effort involved in constructing and operating a steerable 
antenna of that size — the entire structure (including pedestal 
support) weighs over 8,000 tons. 

The antenna pointing systems consist of azimuth and eleva- 
20. tion axes drives which respond to computer-generated trajectory 
commands to steer the antenna in real-time. Pointing accuracy 
requirements for the antenna are such that there is little tolerance 
for component degradation. Achieving the necessary degree of po- 
sitional accuracy is rendered difficult by various non-linearities in 
25 the gear and motor elements and environmental disturbances such 
as gusts of wind affecting the antenna dish structure. Off-beam 
pointing can result in rapid fall-off in signal-to-noise ratios and 
consequent potential loss of irrecoverable scientific data from the 
spacecraft. 

30 The antenna servo pointing systems are a complex mix of electro- 
mechanical components. FIG. 2A includes a simple block diagram 
of the elevation pointing system for a 34m antenna — see Appendix 
2 for a brief description of how the pointing system works. A faulty 
component manifests itself indirectly via a change in the charac- 
35 teristics of observed sensor readings in the pointing control loop. 
Because of the non-linearity and feedback present, direct causal 
relationships between fault conditions and observed symptoms can 
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be difficult to establish - - this makes manual fault diagnosis a slow 
and expensive process. In addition, if a pointing problem occurs 
while a spacecraft is being tracked, the antenna is often shut-down 
to prevent any potential damage to the structure and the track is 
5 transferred to another antenna if possible. Hence, at present, diag- 
nosis often occurs after the fact, where the original fault conditions 
may be difficult to replicate. 

7 Experimental Results 

7.1 Data Collection and Feature Extraction 

The observable antenna data consists of various sensor readings (in 
the form of sampled time series) which can be monitored while the 
15 antenna is in tracking mode. To generate a fault library hardware 
faults were introduced in a controlled manner by switching faulty 
components in and out of the control loop. Sensor variables mon- 
itored included wind speed, motor currents, tachometer voltages, 
estimated antenna position, and so forth. 

20 The time series data was initially sampled at 50 Hz (well above 
the estimated Nyquist sampling rate for signals of interest) and seg- 
mented into windows of 4 seconds duration (200 samples) to allow 
reasonably accurate estimates of the various features. The features 
are derived by applying an autoregressive-exogenous (ARX) mod- 
25 elling technique using the rate feedback command as the input to 
the model and motor current as output, using the definitions illus- 
trated in FIG. 1: 

'</(*) + £ a»2 /(* " *) = £ bju{t-j) + e{t), / = 1,2, . . . ,iV (12) 

i=l 3 = 1 

30 

where y(t) is the motor current, u(t) is the rate command input, 
e(t) is an additive white noise process, and ct, and bj are the model 
coefficients. The model order was chosen bv finding an empirical 
minimum (using data from normal conditions) of the Akaike In- 
3g formation Criterion (AIC) which trades-off goodness-of- fit to the 
data with model complexity (Ljung [14]). An 8th order model was 
chosen in this manner with p = 6 and q = 2. resulting in 8 ARX 
features. Using this model structure, a separate set of ARX coeffi- 


cients was estimated from each successive 4-second window of data 
using direct least mean squares estimation. Hence a new set of fea- 
tures, 0(t ), is available at a rate of 0.25 Hz compared to the original 
sampling rate of 50Hz — for this particular application this rate 
of decision-making is more than adequate. The autoregressive rep- 
resentation is particularly useful for discriminative purposes when 
dealing with time series (Xashyap [15]). 

In addition to the ARX features, there are four time domain 
features (such as the estimated standard deviations of tachometers 
and torque sensors) which were judged to have useful discriminative 
power. It is worth pointing out that for the chosen sample size of 
200 it was found that the assumption that feature estimates do not 
have any temporal dependence across windows was justified. This 
observation is based on empirical results obtained by analyzing the 
correlation structure in the training data. 

7.2 Model Development 

Data was collected at a 34 meter antenna site in Goldstone, Cali- 
fornia, in early 1991, under both normal and fault conditions. The 
two faults corresponded to a failed tachometer in the servo loop 
and a short circuit in the electronic compensation loop — these 
are two of the most problematic components in terms of reliabil- 
ity. The data consisted of 15000 labelled sample vectors for each 
fault, which was converted to 75 feature vectors per class. Data 
was collected on two separate occasions in this manner. Because 
the antenna is in a remote location and is not permanently instru- 
mented for servo component data acquisition, data collection in 
this manner is a time- consuming and expensive task. Hence, the 
models were trained with relatively few data points per class. 

Experiments were carried out with both a feedforward multi- 
layer neural network and a simple maximum-likelihood Gaussian 
classifier. A general description of the neural network model used 
is given in the Appendix. The neural network was chosen over al- 
ternative classification models because of its ability to approximate 
arbitrary decision boundaries in a relatively non-parametric man- 
ner. In addition, by using a mean- square error objective function, 
the outputs of the network can be used as estimates of posterior 
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class probabilities (Richard and Lippmann [15] and Miller, Good- 
man and Smyth [16]). Based on cross-validation results, a network 
with a single hidden layer of 12 units was chosen as the working 
model. The networks were trained using a conjugate gradient van- 
5 ation of the well known backpropagation method (Barnard and 
Cole [18], Powell [19]). The Gaussian classifier used a separate, 
diagonal covariance matrix for each class, where the components 
consisted of maximum likelihood estimates. Using the full covari- 
ance matrix was considered impractical given only 150 samples per 
10 class in 12 dimensions. Components of the Markov transition ma- 
trix A were estimated using a database of trouble reports which 
are routinely collected at all antenna sites — see Appendix 3 for a 
more detailed discussion. 

FIGS. 2 A and 2B illustrate a system embodying the present 
15 invention monitoring an antenna pointing system, including the 
pointing system followed by the parameter estimation stage, which 
is followed below by the parameter/state conditional probability 
model. Finally, the conditional probability model is followed by the 
Markov component, showing both past state estimates and current 
20 instantaneous estimates being combined as in Equation (6). These 
models were implemented in software as part of the data acquisition 
system. The results of testing the models on previously unseen data 
in real-time at the antenna site are discussed in the next section. 

Referring now to FIGS. 2 A and 2B, the measured observables 
25 from the system being monitored (such as the rate commands, 
tachometer readings and torque bias of the antenna pointing sys- 
tem) are received by an on-line parameter estimator 10 of a pa- 
rameter estimation model 20. The parameter estimation model 
20 compares a predicted observable (such as the motor output of 
30 the antenna pointing system) predicted by the parameter estima- 
tor 10 with the actual measurement of that observable (such as 
the actual measured motor output of the antenna pointing sys- 
tem) to form an error signal, which is fed back to the parameter 
estimator 10. From this, the parameter estimator 10 provides es- 
35 timated parameters during each successive sampling interval. The 
estimated parameters may be. for example, statistical quantities 
which reflect the amount of change in each observable. These es- 
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timated parameters are then processed in a conventional classifier 
30 such as a neural network providing a mapping between symp- 
toms (the estimated parameters) and classes (including the normal 
condition state and various types of fault states). The classifier 30 
5 provides instantaneous probability estimates of the states of the 
system based upon the estimated parameters. These instantaneous 
probability estimates are first transformed to instantaneous prob- 
abilities. The instantaneous probabilities are then processed by a 
Markov time correlation model 40 embodying the computation of 
10 Equation 6. Specifically, at each successive sampling interval, the 
Markov model 40 performs the hidden Markov model calculation 
of Equations 5 and 6 to produce the posterior state probabilities 
of the system states, and infers the true system state from the one 
posterior state probability dominating the others. This inference 
15 of the true system state is the system decision at time t (the cur- 
rent sampling interval). Thus, a sequence of hidden Markov model 
calculations 50, 60, 70, and so forth are performed. As indicated in 
FIG. 2, the results of each calculation 50, 60, 70, and so forth are 
saved and used in the next calculation performed during the next 
20 sampling interval. Thus, the calculation 60 performed during the 
current sampling interval at time t uses the results of the calcu- 
lation 50 performed during the previous sampling interval at time 
t-1. Moreover, the results of the current calculation 60 are used by 
the next calculation 70 performed at time t+1. 

25 

Each calculation 50, 60, 70, and so forth uses Equation 6 to 
compute the intermediate probability of Equation 4 and then em- 
ploys the rule of Equation 5 to compute the posterior system prob- 
abilities. The intermediate probability is equal to the correspond- 
30 ing instantaneous probability of the one state multiplied by a sum 
over plural states of the intermediate probability for a given state 
computed during the previous sampling interval multiplied by the 
transition probability between the given state and the one state. 
Finally, the method is completed by computing from the interme- 
35 diate probability for each one of the states of the current sampling 
interval the posterior probability that the system is in the cor- 
responding one of the states, and determining from the posterior 
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probabilities whether the system hfts transitioned to one of the fail- 
ure states and, if the system has transitioned to one of the failure 
states, issuing an alarm corresponding thereto. 

5 Defining plural transition probabilities includes observing a 
mean time between failures (MTBF) characteristic of each of the 
failure states and computing each corresponding transition prob- 
ability therefrom. Computing the corresponding transition prob- 
ability includes dividing the time period of the sampling intervals 
10 by the MTBF and subtracting the resulting quotient from unity. 

Transforming the instantaneous probabilities to the instanta- 
neous probabilities is accomplished using Bayes’ rule. 

15 7.3 Classification Results 

The neural and Gaussian models, both with and without the Markov 
component, were tested by monitoring the antenna as it moved at 
typical deep-space tracking rates of about 4 mdeg/second. The 
20 results reported below consist of summary results over a variety of 
different short tests: the cumulative monitoring time was about 1 
hour in duration. 


25 


30 


35 
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Class 

Without Markov model 

With Markov model | 

Gaussian 

Neural 

Gaussian 

Neural 

Normal Conditions 

0.36 

1.72 

0.36 

0.00 

Tachometer Failure 

27.78 

0.00 

2.38 

0.00 

Compensation Loss 

34.21 

0.00 

43.16 

0.00 

All Classes 

16.92 

0.84 

14.42 

0.00 


Table 1: Percentage misclassifation rates for Gaussian and neural 
15 models both with and without Markov component. 

Table 1 summarizes the overall classification performance for 
each of the models, and both for each individual class and for 
all classes averaged together. Clearly, from the final column, the 
20 neural-Markov model is the best model in the sense that no win- 
dows at all were misclassified. It is significantly better than the 
Gaussian classifier which performed particularly poorly under fault 
conditions. However, under normal conditions it was quite accurate 
having only 1 false alarm during the roughly 30 minutes of time de- 
25 voted to monitoring normal conditions — this is not too surprising 
since in theory at least the ARX coefficients should obey a multi- 
variate Gaussian distribution given that the model is correct, i.e., 
for the non-fault case (Ljung 14 ). The effect of the Markov model 
is clearly seen to have beneficial effects, in particular reducing the 
30 effects of isolated random errors. However, for the compensation 
loss fault, the Markov model actually worsened the already poor 
Gaussian model results, which is to be expected if the non-Markov 
component is doing particularly poorly as in this case. 


35 
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Class 

Without Markov model 

With Markov model | 


Neural 

Gaussian 

Neural 

Normal Conditions 

-2.44 

-1.97 

-2.46 

-4.24 

Tachometer Failure 

-0.40 

-3.52 

-0.42 

-4,22 

Compensation Loss 

-0.82 

-3.48 

-1.39 

-4.71 

All Classes 

-0.87 

-2.29 

-1.02 

-4.34 


Table 2: Logarithm of Mean Squared Error for Gaussian and 
neural models both with and without Markov component. 


Table 2 presents the same data summarized in terms of the 
logarithm (base 10) of the mean-square error (MSE), calculated as 
follows: 

15 MSE = 1 £ - °iU)) 2 (13) 

where p(uJi(j )) is the classifier’s estimate of the posterior probabil- 
ity of class i for input j, Oj(j) = 1 if is the true class for input 
j and zero otherwise, and N is the size of the training data set. 

20 

The mean-square error provides more information on the probabil- 
ities being produced by the classifier than the classification error 
rates. Lower values imply that the probabilities are sharper, i.e., 
the classifier is more certain in its conclusion. The general trend 

in Table 2 is that the neural-Markov combination is significantly 
25 

better than any of the other combinations. 

FIGS. 3, 4, and 5 plot the estimated probability of the true class 
as a function of time for various models to allow a more detailed 
interpretation of the results. Note that, given that the true class 
is labelled «, the estimated probability of class i from the neural 
network corresponds to the normalized output of output unit i of 
the network at time i\ i.e.. 


35 



(14) 


(where dj(t) is the value of the i th network output node) while 
the Markov probabilities correspond to the estimates of p( Q(t.) — 
^v|<I>(#)), as described earlier in Equation G. 
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FIG. 3 corresponds to normal conditions and compares the neu- 
ral model with and without the Markov processing. The instan- 
taneous probability estimates from the neural model have a large 
variation over time and are quite noisy. This is essentially due 
5 to the variation in the sensor data from one window to the next, 
since as might be expected, signals such as motor current contain 
significant noise. In addition, a large glitch is visible at about 460 
seconds. The neural model gives a low probability that the con- 
dition is normal for that particular window (in fact a large glitch 
io such as this looks like a tachometer failure problem), however, the 
Markov model remains relatively unaffected by this single error. 
Overall, the stability of the Markov model is clearly reflected in 
this plot and has significant advantages in an operational environ- 
ment in terms of keeping the false alarm rate to a minimum. Note 
15 that at any particular instant the neural network only ever assigns 
a probability of up to 0.8 or 0.9 to the true class. In contrast, by 
modelling the temporal context, the neural-Markov model assigns 
a much greater degree of certainty to the true class. 

FIG. 4 compares the performance of the Gaussian, Gaussian- 
20 Markov and neural-Markov models on detecting the compensation 
loss fault. The variation in the Gaussian estimates is quite no- 
ticeable. The Gaussian-Markov model combination, after some 
initial uncertainty for the first 90 or so seconds, settles down to 
yield reasonable estimates. However, the overall superiority of the 
25 neural-Markov model (the upper curve) is evident. 

FIGS. 5A through 5C and FIGS. 6A through C show the perfor- 
mance of the neural network classifier without and with the hidden 
Markov model, respectively, while monitoring the antenna for a to- 
tal duration of about 1 hour. Tachometer failure and compensation 
30 loss fault are introduced into the system after 14 minutes and 44 
minutes respectively, each lasting roughly 15 minutes in duration. 
The difference in the quality of the 2 approaches is clearly visible 
in the figures and leaves little doubt as to the utility of the Markov 
method. 

35 The results presented above clearly demonstrate the ability of a 
hidden Markov model to enhance the overall quality and reliability 
of a monitoring system s decisions. From a practical standpoint. 
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the difference is significant: the non-Markov systems would not 
be reliable for actual operational use since they are too noisy and 
would have an unacceptably large false alarm rate. In contrast, the 
Markov-based system is a serious candidate foi field implementa- 
5 tion, particularly for installation in ail new antenna designs. How- 
ever there are significant opportunities for further improvement in 
models of this nature. 

8 Detecting Novel Classes 

While the neural model described above exhibits excellent perfor- 
mance in terms of discrimination, there is another aspect to classi- 
fier performance which must be considered for applications of this 
nature: how will the classifier respond if presented with data from 
15 a class which was not included in the training set. Ideally, one 
would like the model to detect this situation. For fault diagnosis 
the chance that one will encounter such novel classes under oper- 
ational conditions is quite high since there is little hope of having 
an exhaustive library of faults to train on. 

20 In general, with any non-par ame trie learning algorithm, there 

can be few guarantees about the extrapolation behavior of the re- 
sulting model (Genian, Bienenstock and Doursat [20]). The re- 
sponse of the trained model to a point far away from the training 
data may be somewhat arbitrary, since it may lie on either side of 
25 a decision boundary, the location of which in turn depends on a va- 
riety of factors such as initial conditions for the training algorithm, 
objective function used, particular training data, and so forth. One 
might hope that for a feedforward multi-layer perception, novel in- 
put vectors would lead to low response for all outputs. However, if 
30 neural activation units with non-local response functions are used 
in the model (such as the commonly used sigmoid function), the 
tendency of training algorithms such as backpropagation is to gen- 
erate mappings which have a large response for at least one of the 
classes as the attributes take on values which extend well beyond 
35 the range of the training data values. Kramer and Leonard [21] dis- 
cuss this particular problem of poor extrapolation in the context of 
fault diagnosis of a chemical process plant. The underlying prob- 
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5 


10 


15 


lem lies in the basic nature of discriminative models which focus 
on estimating decision boundaries based on the differences between 
classes. In contrast, if one wants to detect data from novel classes, 
one must have a generative model for each known class, namely one 
which specifies how the data is generated for these classes. Hence, 
in a probabilistic framework, one seeks estimates of the probability 
density function of the data given a particular class, f{0\Q = a;,), 
from which one can in turn use Bayes’ rule for prediction: 


p(Q = oj{\6) = 


f(B\Q = u>j)p(Q = Uj) 

EitLi = w A )p(n = u k ) 


(15) 


Generative models have certain disadvantages: they can per- 
form poorly in high dimensions, and for a fixed amount of data 
may not be as efficient in terms of approximating the Bayes deci- 
sion boundary as a purely discriminative method. 


9 Discussion 

The hidden Markov method for on-line health monitoring proposed 
20 in this specification relies on certain key assumptions which may 
or may not be true for particular applications. In particular, for 
the purposes of this discussion it is assumed that: 

1. Faults are discrete in nature (i.e., they are “hard” failures 

rather than gradual degradation) and are known in advance. 
25 

2. There is a fault library of classified data (for some embodiments 
of the present invention) in order to train the model. 

3. Symptom estimates are statistically independent from one win 

dow to the next, conditioned on the classes. 

30 

However, it should be pointed out that these assumptions could 
potentially be relaxed and the model further refined. For example, 
a fault library may not be necessary if the symptom-fault, depen- 
dence can be specified based on prior knowledge. Similarly, the 
35 assumption of independence of symptom estimates across windows 
is not strictly necessary — it makes the model much simpler, but 
could be included in Equation 6 if such dependence is known to 
exist and can be modelled. 


10 Conclusion 


Effective modelling of temporal context in continuous monitoring 
applications can considerably improve the reliability and accuracy 
of a decision system. In particular, it has been shown in this spec- 
ification that hidden Markov models provide an effective method 
for incorporating temporal context in conjunction with traditional 
classification methods. The Markov model approach has the abil- 
ity to significantly reduce the false alarm rate of a classification 
system by taking advantage of any time domain redundancy which 
may be present. The model was demonstrated on a real-world an- 
tenna fault diagnosis problem — the empirical results demonstrate 
clearly the advantage of the Markov approach. In general, the use 
of hidden Markov models for continuous monitoring seems to have 
promise: applications to other critical applications such as medi- 
cal diagnosis in intensive care situations, nuclear plant monitoring, 
and so forth, appear worthy of further investigation. 

While the invention has been described in detail with reference 
to preferred embodiments, it is understood that variations and 
modifications thereof may be made without departing from the 
true spirit and scope of the invention. 
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Appendix 1: Neural Network Model Description 

The following is a description of an example of a popular feed- 
forward multi-layer neural network model to familiarize the reader 
5 with the general notation and concepts. FIG. 7 shows an example 
of such a neural network. The input nodes are labeled n, 1 < * < 
K -f 1 , the hidden nodes are labelled hj , 1 < j < H , and the output 
layers are labelled o* 1 < & < m. In general, there are K -f 1 input 
units, where K is the number of features. The extra node is always 
10 in the “on” state, providing a threshold capability. Similarly, there 
are m output nodes, where m is the number of classes. 

The number of hidden units H in the hidden layer can influence 
the classifier performance in the following manner: too many and 
the network overfits the data, whereas too few hidden units leaves 
15 the network with insufficient representational power. The appro- 
priate network size is typically chosen by varying the number of 
hidden units and observing cross-validation performance. 

Each input unit i is connected to each hidden unit j by a link 
with weight W{j , and each hidden unit j is connected to each 
20 output unit k by a weighted link Wjk. Each hidden unit calculates 
a weighted sum and passes the result through a non-linear function 
F(), i-e., 

/i=I\+l \ 

a(hj) = w tj a ( n i)J 

25 where a(ra,-) is the activation of input unit i — typically, this is just 
a linear (scaled) function of the input feature. A commonly used 
non-linear function in the hidden unit nodes F(x) is the so-called 
sigmoid function, defined as 

30 F(X) = 

Output unit k calculates a similar weighted sum using the 
weights Wjk between the j th hidden unit and the A;th output unit, 
i.e., 

35 a h = & (S w jk a ( hj ) ) 

J 

where a* is the activation of the A;th output node, xhe function 
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G(x) can be chosen either as linear (e.g. G(x ) = x) or as a non- 
linear function. For example for a classification problem such as 
that described in this specification the sigmoid function is used to 
restrict the range of the output activations to the range [0,1]. A 
5 classification decision is made by choosing the output unit with 
the largest activation for a given set of inputs (feature values); i.e., 
choose class k such that 


k = argmax{a,} 

10 The network design problem is then to find the best set of 
weights such that a particular objective function is minimized on 
the N training data samples — the training data is in the form of 
input-output pairs {x.j, yj}, 1 < j < N where xj is a feature vector 
and yj is the desired output. (For simplicity of notation assume 

15 that there is only a single output model). Let yj(Q,Xj) be the 
network output for a particular set of weights Q and input vector 
x_j. The objective function is typically some metric on yj and yj, 
whose mean value is estimated on the training data. Commonly 
used such objective functions include the mean-squared error 

20 

Emse = Ete ~ yj{^,Vj)) 2 
iV j=l 


and the cross-entropy error 


25 


30 


35 


1 N 

ECE = T7 £ yj log 
iv j= 1 


, V + (! - yj) lo g 1 _\.IO r .\ 

yj(Qi, x_j ) 1 f/j ( S2 , j_j ) 


From a maximum likelihood perspective the mean-squared er- 
ror approach essentially assumes that the training data is perturbed 
by additive Gaussian noise, while the cross-entropy function as- 
sumes a multinomial distribution on the class labels. Despite these 
significantly different assumptions, for classification problems there 
appears to be little practical difference in terms of classification 
performance between these objective functions. For the experi- 
ments reported in this specification the mean-squared error objec- 
tive function was used. 
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Appendix 2: Description of the Antenna Point- 
ing System 

FIG. 2 A includes a block diagram of the elevation axis antenna 
5 drive subsystem (there is a corresponding azimuth axis drive for 
positioning the antenna in the azimuth axis). The elevation drive 
subsystem is a closed-loop control system that consists of a dig- 
ital control computer, two 7.5 horsepower direct current motors, 
two servo amplifiers, two cycloid gear reducers, two tachometers, 
10 and various electronic components for signal conditioning and servo 
compensation. The two forward tachometer/amplifier/motor/gear 
paths operate in tandem to drive a large bull gear which is at- 
tached to the antenna structure (a 34m dish plus supporting metal 
structure). Feedback control is provided by both rate feedback 
15 from each motor to its tachometer and a position feedback loop. 
The antenna position is estimated by an optical encoder and fed 
back to the antenna servo controller. The antenna servo controller 
is a microprocessor-based system which implements a PI (propor- 
tional plus integral) control algorithm by integrating both the com- 
20 manded position (which is a digital signal sent from a ground sta- 
tion control computer describing the desired position) and the ac- 
tual position estimate. The digital portion of the control loop (the 
antenna servo controller) updates at a 50Hz rate. The reconstruc- 
tion filter and the loop compensation components are filters for 
25 signal conditioning and control loop compensation. Finally, the 
torque bias signal is a voltage measurement proportional to load 
torque which is fed back from the gears in order to share the torque 
between the two motors, reduce the effect of parameter variations 
between them and to effectively bias the cycloid gears away from 
30 non-linear regions of operation. 


35 
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Appendix 3: Specification of the Markov Tran- 
sition Matrix for the Antenna Pointing Problem 


5 


10 


Training and test data under fault conditions were obtained by 
switching faulty components in and out of the servo control loop. 
Hence, for the purposes of this experiment, the two fault conditions 
were modelled as intermittent faults and fault transitions between 
these two states were allowed. The Markov transition matrix A 
was set as follows: 


A = 


0.999 
0.0005 
^ 0.0005 


0.005 

0.99 

0.005 


0.005 ^ 
0.005 
0.99 ; 


This corresponds to a system MTBF of about 1 hour and 7 minutes 
15 given the 4 second decision interval. It also assumes that each fault 
is equally likely to occur and that the mean duration of each fault 
is about 6 minutes and 40 seconds. The initial state probabilities 
were chosen to be equally likely: 


20 tt(O) = (1/3, l/3,l/3). 

The actual MTBF of the system under operational conditions 
was estimated from a problem database to be about 30 hours if 
only hard faults are considered. However, if intermittent tran- 
sient faults are also considered, the MTBF is effectively reduced to 
25 about 1 hour — this estimate is based on empirical observations 
of the antenna in an operational tracking mode. Hence, while the 
self- transition probabilities of the fault states are set in a some- 
what artificial manner for this experiment, the value chosen for an 
correlates well with the effective MTBF of the system. 

30 As mentioned previously herein, the state estimates of the 
model are relatively robust to changes in the values of the transi- 
tion probabilities. For example, increasing 1 — «n by an order of 
magnitude causes the estimates to be slightly less stable but does 
not introduce any additional false alarms, while reducing 1 — «n by 
35 an order of magnitude causes no significant difference in the results 
other than the time for the model to switch from normal to a fault 
state (after a fault has actually occurred) increases from a single 
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4-second interval to 2 or 3 such intervals. It should be pointed out 
that the robustness of the method in general to misspecification 
errors in the transition matrix is a topic for further investigation. 

The geometric distribution was found to be a reasonable fit for 
5 the distribution of durations between failures, thus validating the 
first-order Markov assumption. 
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25 


30 


35 
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Appendix 4: Kernel Density Estimation 


Unless one assumes a particular parametric form for then it 

must be somehow estimated from the data. The multi-class nature 
5 of the problem is now ignored temporarily in favor of a single-class 
case. The present description focuses here on the use of kernel- 
based methods. Consider the 1-dimensional case of estimating the 
density f(x ) given samples {a;,}, 1 < i < N. The idea is simple 
enough: an estimate f(x ) is obtained, where x is the point at which 
10 the density must be found, by summing the contributions of the 
kernel K((x—Xi)/h) (where h is the bandwidth of the estimator, and 
K(.) is the kernel function ) over all the samples and normalizing 
such that the estimate is itself a density, i.e., 


20 


a 

The estimate f(x) directly inherits the properties of A'(.), hence it 
is common to choose the kernel shape itself to be some well-known 
smooth function, such as a Gaussian. For the multi-dimensional 
case, the product kernel is commonly used: 



1 

Nh\...hd 



where x k denotes the component in dimension k of vector x, and 
25 the hi represent different bandwidths in each dimension. 

Various studies have shown that the quality of the estimate 
is typically much more sensitive to the choice of the bandwidth h 
than it is to the kernel shape K (.). Cross-validation techniques are 
usually the best method to estimate the bandwidths from the data, 
30 although this can be computationally intensive and the resulting 
estimates can have a high variance across particular data sets. A 
significant disadvantage of kernel models is the fact that all training 
data points must be stored and a distance measure between a new 
point and each of the stored points must be calculated for each 
35 class prediction. Another less obvious disadvantage is the lack of 
empirical results and experience with using these models for real- 
world applications — in particular there is a dearth of results for 
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high-dimensional problems. In this context, a kernel approximation 
model is described which is considerably simpler both to train and 
implement than the full kernel model. 
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Appendix 5: Kernel Approximation using Mix- 
ture Densities 

An obvious simplification to the full kernel model is to replace clus- 
5 ters of data points by representative centroids, to be referred to as 
the centroid kernel model. Intuitively, the sum of the responses 
from a number of kernels is approximated by a single kernel of ap- 
propriate width. Algorithms for bottom-up merging of data points 
for problems of this nature have been proposed. Here, however, a 
10 top-down approach is followed by observing that the kernel esti- 
mate is itself a special case of a mixture density. The underlying 
density is assumed to be a linear combination of L mixture com- 
ponents, i.e., 

1R f(x) = 'Eaifi(x) 

15 »=i 

where the a,- are the mixing proportions. The full kernel esti- 
mate is itself a special case of a mixture model with a,- = l/N 
and fi(x ) = K(x). Hence, the centroid kernel model can also be 
treated as a mixture model but now the parameters of the mix- 
20 ture model (the mixing proportions or weights, and the widths 
and locations of the centroid kernels) must be estimated from the 
data. There is a well-known and fast statistical procedure known 
as the EM (Expectation-Maximization) algorithm for iteratively 
calculating these parameters, given some initial estimates. Hence, 
25 the procedure for generating a centroid kernel model is straightfor- 
ward: divide the training data into homogeneous subsets according 
to class labels and then fit a mixture model with L components to 
each class using the EM procedure (initialization can be based on 
randomly selected prototypes). Prediction of class labels then fol- 
30 lows directly from Bayes’ rule. Note that there is a strong similarity 
between mixture/kernel models and Radial Basis Function (RBF) 
networks. However, unlike the RBF models, the user does not 
train the output layer of the network in order to improve discrim- 
inative performance as this would potentially destroy the desired 
35 probability estimation properties of the model. 





HIDDEN MARKOV MODELS FOR FAULT 
DETECTION IN DYNAMIC SYSTEMS 


ABSTRACT OF THE DISCLOSURE 

c 

The invention is a system failure monitoring method and appa- 
ratus which learns the symptom-fault mapping directly from train- 
ing data. The invention first estimates the state of the system at 
discrete intervals in time. A feature vector & of dimension k is 
estimated from sets of successive windows of sensor data. A pat- 
tern recognition component then models the instantaneous esti- 
mate of the posterior class probability given the features, p(a;,|x), 
1 < i < m. Finally, a hidden Markov model is used to take ad- 
vantage of temporal context and estimate class probabilities con- 
ditioned on recent past history. In this hierarchical pattern of 
information flow, the time series data is transformed and mapped 
into a categorical representation (the fault classes) and integrated 
over time to enable robust decision-making. 
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